<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Frequently asked questions [Universal Encoding Detector]</title>
<link rel="stylesheet" href="css/chardet.css" type="text/css">
<link rev="made" href="mailto:mark@diveintomark.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.65.1">
<meta name="keywords" content="character, set, encoding, detection, Python, XML, feed">
<link rel="start" href="index.html" title="Documentation">
<link rel="up" href="index.html" title="Documentation">
<link rel="prev" href="index.html" title="Documentation">
<link rel="next" href="supported-encodings.html" title="Supported encodings">
</head>
<body id="chardet-feedparser-org" class="docs">
<div class="z" id="intro"><div class="sectionInner"><div class="sectionInner2">
<div class="s" id="pageHeader">
<h1><a href="/">Universal Encoding Detector</a></h1>
<p>Character encoding auto-detection in Python.  As smart as your browser.  Open source.</p>
</div>
<div class="s" id="quickSummary"><ul>
<li class="li1">
<a href="http://chardet.feedparser.org/download/">Download</a> ·</li>
<li class="li2">
<a href="index.html">Documentation</a> ·</li>
<li class="li3"><a href="faq.html" title="Frequently Asked Questions">FAQ</a></li>
</ul></div>
</div></div></div>
<div id="main"><div id="mainInner">
<p id="breadcrumb">You are here: <a href="index.html">Documentation</a> → <span class="thispage">Frequently asked questions</span></p>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h2 class="title">
<a name="faq" class="skip" href="#faq" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Frequently asked questions</h2></div></div>
<div></div>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.intro" class="skip" href="#faq.intro" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> What is character encoding?</h3></div></div>
<div></div>
</div>
<p>When you think of “<span class="quote">text</span>”, you probably think of “<span class="quote">characters and symbols I see on my computer screen</span>”.  But computers don’t deal in characters and symbols; they deal in bits and bytes.  Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <span class="emphasis"><em>character encoding</em></span>.  There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.  Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.</p>
<p>In reality, it’s more complicated than that.  Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk.  So you can think of the character encoding as a kind of decryption key for the text.  Whenever someone gives you a sequence of bytes and claims it’s “<span class="quote">text</span>”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.what" class="skip" href="#faq.what" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> What is character encoding auto-detection?</h3></div></div>
<div></div>
</div>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text.  It’s like cracking a code when you don’t have the decryption key.</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.impossible" class="skip" href="#faq.impossible" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Isn’t that impossible?</h3></div></div>
<div></div>
</div>
<p>In general, yes.  However, some encodings are optimized for specific languages, and languages are not random.  Some character sequences pop up all the time, while other sequences make no sense.  A person fluent in English who opens a newspaper and finds “<span class="quote">txzqJv 2!dasd0a QqdKjvz</span>” will instantly recognize that that isn’t English (even though it is composed entirely of English letters).  By studying lots of “<span class="quote">typical</span>” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.</p>
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.who" class="skip" href="#faq.who" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Who wrote this detection algorithm?</h3></div></div>
<div></div>
</div>
<p>This library is a port of <a href="http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/">the auto-detection code in Mozilla</a>.  I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves).  I have also retained the original authors’ comments, which are quite extensive and informative.</p>
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href="http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html">A composite approach to language/encoding detection</a>.</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.yippie" class="skip" href="#faq.yippie" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Yippie!  Screw the standards, I’ll just auto-detect everything!</h3></div></div>
<div></div>
</div>
<p>Don’t do that.  Virtually every format and protocol contains a method for specifying character encoding.</p>
<div class="itemizedlist"><ul>
<li>HTTP can define a <tt class="literal">charset</tt> parameter in the <tt class="literal">Content-type</tt> header.</li>
<li>HTML documents can define a <tt class="literal">&lt;meta http-equiv="content-type"&gt;</tt> element in the <tt class="literal">&lt;head&gt;</tt> of a web page.</li>
<li>XML documents can define an <tt class="literal">encoding</tt> attribute in the XML prolog.</li>
</ul></div>
<p>If text comes with explicit character encoding information, you should use it.  If the text has no explicit information, but the relevant standard defines a default encoding, you should use that.  (This is harder than it sounds, because standards can overlap.  If you fetch an XML document over HTTP, you need to support both standards <span class="emphasis"><em>and</em></span> figure out which one wins if they give you conflicting information.)</p>
<p>Despite the complexity, it’s worthwhile to follow standards and <a href="http://www.w3.org/2001/tag/doc/mime-respect">respect explicit character encoding information</a>.  It will almost certainly be faster and more accurate than trying to auto-detect the encoding.  It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.</p>
</div>
<div class="section" lang="en">
<div class="titlepage">
<div><div><h3 class="title">
<a name="faq.why" class="skip" href="#faq.why" title="link to this section"><img src="images/permalink.gif" alt="[link]" title="link to this section" width="8" height="9"></a> Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</h3></div></div>
<div></div>
</div>
<p>Sometimes you receive text with verifiably inaccurate encoding information.  Or text without any encoding information, and the specified default encoding doesn’t work.  There are also some poorly designed standards that have no way to specify encoding at all.</p>
<p>If following the relevant standards gets you nowhere, <span class="emphasis"><em>and</em></span> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort.  An example is my <a href="http://feedparser.org/">Universal Feed Parser</a>, which calls this auto-detection library <a href="http://feedparser.org/docs/character-encoding.html">only after exhausting all other options</a>.</p>
</div>
</div>
<div class="footernavigation">
<div style="float: left">← <a class="NavigationArrow" href="index.html">Documentation</a>
</div>
<div style="text-align: right">
<a class="NavigationArrow" href="supported-encodings.html">Supported encodings</a> →</div>
</div>
<hr>
<div id="footer"><p class="copyright">Copyright © 2006, 2007, 2008 Mark Pilgrim · <a href="mailto:mark@diveintomark.org">mark@diveintomark.org</a> · <a href="license.html">Terms of use</a></p></div>
</div></div>
</body>
</html>
